Many factors affect an NBA game. To understand the impact of each factor, three different data sets will be utilized. This enables us to analyze each games at player level as well as team level and check the rationality of certain “NBA Myths”. The following questions will guide our analysis.
Conclusion from analysis is only a good as the data available. This project assumes that the data sets I am using are accurate reflections of the true historical NBA data.
In the data preparation step, added columns to the shots dataset for enhancement to enable ease of use for analyzing. Also created a subsetted of the shots data to analyze the Warriors more closely.
NBA court size is 94 X 50 feet. 3 point line is 23ft and 9 inches. Graphically visual of all points made given the timeline of our data set.
Any correlations between the x and y distance of their successful shoots?
This shows where in the arc steph curry can make shots. 1.Does this imply that Stephen Curry can only make shots at the top arc and not ? 2. Jimmy Butler makes a lot of his shots by the rim -so x and y are highly coorelated.
## The Mean Shooting Distance for the League is 14.4783 feet
## sample size = 10 Mean = 14.4353 SD = 3.1277
## sample size = 20 Mean = 14.564 SD = 2.2072
## sample size = 30 Mean = 14.5105 SD = 1.8753
## sample size = 40 Mean = 14.4919 SD = 1.5672
The Central Limit Theorem states that as we increase our sample size, our sample mean will reflect closer to the true population mean. Furthermore, the sample variance of our sample mean will decrease and the distribution will become normally distributed. As a result, we are more confident that our sample mean reflects the true population mean. Graphically, we see this. The spread of our histogram decreases as we increase our sample size because of a smaller variance. Furthermore,the true mean is a shooting distance of 14.49feet, so the difference between the population mean and the sample mean becomes smaller as the sample size increases.
Conclusion: 1. Steph Curry generates the most points, followed by Andrew Wiggins and Jordan Poole. He has a lower average points made per attempt tahn other players, but this may be due to a multitude of reasons. - having more challenging looks (guarded shots) - providing good assists to this teammates - shooting farther
There are many sampling methods and the sample results are used to estimate the population characteristics. Employing several sampling methods, the follow analysis is aim to understand the true shooting distance of the Warriors Team. Comparing these numbers with the Leagues’ shooting average will help us quantify how strong a shooting Team the Warriors is.
## The Mean Distance for the GS Warriors is 15.39 feet
## Mean from Stratified Sampling of the CSG is 14.34
## Mean from Sample Inclusion is 25.49
## Mean from Systematic Sampling is 15.15
## Mean from SRSWOR is 13.94
Conclusion:
Is the Wait Time normally distributed? This analysis of hot hand would imply that there’s a peak closer to 0.
## Mean Wait Time for Curry = 1.78 Number of Shots
## Mean Wait Time for Durant = 1.2 Number of Shots
## Mean Wait Time for Jokic = 0.29 Number of Shots
##
## Kevin Durant Stephen Curry
## 2.00 5.25
Conclusions: Durant and Jokic has a shorter wait time until their next made shot compare to Curry. However Curry takes more three points. This would mean that every successful shot of Durants and Jokic amounts to 2 points while Curry’s made shots can contribute 2 - 3 points.
It looks like Jokic has a short wait time until his next successful shot given that he has made a shot. However, this was based on 1 game for Jokic, explaining the white spaces between the blue bars. So, the hot hand myth is inconclusive given the lack of data.
Stephen Curry averages 5.25 threes per game while Kevin Durant only averages 2 threes per game.
After examining the dataset, there are irrelevant records in the dataset
Need to filter out columns that are irrelevant or has replicated information
not complete data for 2015 or 2016. So we will be focusing on 2017 only.
fig <- plot_ly(merge_df, x = ~Wins_Home, y = ~Team, type = 'bar', orientation = 'h', name = 'HomeGameWin',
marker = list(color = 'rgba(246, 78, 139, 0.6)',
line = list(color = 'rgba(246, 78, 139, 1.0)',
width = 1)))
fig <- fig %>% add_trace(x = ~Losses_Home, name = 'HomeGameLoss',text = paste0(round(merge_df$HomeWinningPercentage,2)*100,"%"), textposition = 'outside',
marker = list(color = 'rgba(58, 71, 80, 0.6)',
line = list(color = 'rgba(58, 71, 80, 1.0)',
width = 1)))
fig <- fig %>% layout(barmode = 'stack',
title = "Number of Wins and Loss at Home Games for all NBA season 2017 - 2018",
xaxis = list(title = "Number of Home Games"),
yaxis = list(title ="NBA Teams")
)
figcat(paste("On average, NBA teams has a ", round(mean(merge_df$HomeWinningPercentage) * 100,0),"%", " chance of winning at Home Game, giving them a slight advantage over their opponents.", sep =""))## On average, NBA teams has a 58% chance of winning at Home Game, giving them a slight advantage over their opponents.
This is only one perspective. There may be many confounding variables e.g. good teams will win at home and away - masking the advantage or home court advantage if there is one. So we cannot make a conclusive statement.
location_df = nba
row.names(location_df) <- 1:nrow(location_df)
#ploting location of
library(maps)
data(us.cities)
remove_cities = c('Cleveland OH', 'North Atlanta GA', 'West New York NY', 'North Miami Beach FL', 'North Miami FL', 'Portland ME', 'Port Charlotte FL', 'North Las Vegas NV', 'Kansas City KS', 'Seattle Hill-Silver Firs WA', 'South San Francisco CA', 'West Sacramento CA', 'East Los Angeles CA', 'Miami Beach FL')
us.cities = subset(us.cities, !us.cities$name %in% remove_cities)
location_df = nba[, c('seasonYear','hTeam.nickName','vTeam.nickName', 'arena', 'city')]
idx2 <- sapply(location_df$city, grep, us.cities$name)
idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))
location_df = cbind(location_df[unlist(idx1),,drop=F], us.cities[unlist(idx2),,drop=F])
#graphical analysis
graph_loc = unique(location_df[c("city", "name","arena", "lat", "long")])
arena_wins = nba %>% group_by(arena, teamwon, hTeam.nickName
) %>% filter(teamwon == hTeam.nickName) %>% summarise(Wins_in_Arena= n())
geo_wins = merge(x = graph_loc, y = arena_wins, by= "arena", all.x = TRUE)
#GRAPHING WINS BY AREA ON MAP
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showland = TRUE,
landcolor = toRGB("gray95"),
subunitwidth = 1,
countrywidth = 1,
subunitcolor = toRGB("white"),
countrycolor = toRGB("white")
)
fig <- plot_geo(geo_wins, lat = ~lat, lon = ~long)
fig <- fig %>% add_markers(
text = ~paste(arena, city,sep = "<br />"),
size = ~Wins_in_Arena, hoverinfo = "text"
)
fig <- fig %>% colorbar(title = "Games")
fig <- fig %>% layout(
title = 'US Map of NBA locations', geo = g
)
figTo eliminate others factors that may be affecting the impact of player’s performance in different locations. Denver is the primarly focus. The objective is to calculate the difference between chance of winning for every team not at Devener vs at Denver.
# what's this team's winning percentage at denver?
denver = subset(nba, arena %in% "Pepsi Center")
teamwon = denver %>% group_by(teamwon) %>% summarise(count = n())
teamvisitng = denver %>% group_by(vTeam.nickName) %>% summarise(count = n())
colnames(teamvisitng) = c("team", "num_games")
colnames(teamwon) = c("team", "num_games_won")
winningatdenver = merge(x = teamvisitng, y = teamwon, by="team" , all.x = TRUE)
winningatdenver[is.na(winningatdenver)] = 0
winningatdenver$winning_percentage = winningatdenver$num_games_won/winningatdenver$num_games
cat(str_c("League's likelihood of winning at Denver is ", round(mean(winningatdenver$winning_percentage),4)*100, "%", "\n", "League's likelihood of winning at Away Game is ", round(mean(merge_df$AwayWinningPercentage),4)*100,"%", "\n",
"SD is ", round(sd(merge_df$AwayWinningPercentage),4)*100,"%"))## League's likelihood of winning at Denver is 30.46%
## League's likelihood of winning at Away Game is 40.9%
## SD is 14.5%
Even though teams are less likely to win at Denver, 30.46% is within 1 standard derivation of 40.9% (+/- 14.5%). This means we cannot make any conclusions with high confidence.
fig = plot_ly(type = 'box')
fig = fig %>% add_boxplot(y = merge_df$HomeWinningPercentage, quartilemethod="linear", name="% of Winning at Home Games",
jitter = 0.3, pointpos = -1.8, boxpoints = 'all')
fig = fig %>% add_boxplot(y = merge_df$AwayWinningPercentage, quartilemethod="linear", name="% of Winning at Won at Away Games",
jitter = 0.3, pointpos = -1.8, boxpoints = 'all')
fig = fig %>% layout(title = list(text = "Distribution of Winning Percentage for Teams"),
yaxis = list(range=c(0,1)
)
)
figThis supports the idea that there’s high variance in Winning Expectation for Teams.
#warriors performance at home came ver away game:
warriors = nba[ (nba$hTeam.nickName == "Warriors" ) | (nba$vTeam.nickName == "Warriors"), ]
#warriors performance at home games:
warriorshome <- warriors[which(warriors$hTeam.nickName == "Warriors"),]
homedensity <- density(warriorshome$hTeam.score.points)
#warriors performance at away games:
homedensityaway = warriors[which(warriors$vTeam.nickName == "Warriors"), ]
awaydensity = density(homedensityaway$vTeam.score.points)
vline <- function(x = 0, color = "grey") {
list(
type = "dash",
y0 = 0,
y1 = 1,
yref = "paper",
x0 = x,
x1 = x,
line = list(color = color, alpha=0.6)
)
}
fig <- plot_ly(x = ~homedensity$x, y = ~homedensity$y, type = 'scatter', mode = 'lines', name = 'Points Made at Home Games', fill = 'tozeroy') %>% layout(shapes = list(vline(median(homedensity$x)), vline(median(awaydensity$x))))
fig <- fig %>% add_trace(x = ~awaydensity$x, y = ~awaydensity$y, name = 'Points Made at Away Games', fill = 'tozeroy')
fig <- fig %>% layout(xaxis = list(title = 'Density of the Points Distribution'),
yaxis = list(title = 'Density'),
title = list(text = "Warriors: Points Distribution at Home vs Away Games between 2015 - 2020", y = 0.95))
figcat(str_c("The median of the GSW Total Score is " , median(awaydensity$x), ". \n", "The median of the GSW Total Score at Home Games is", median(homedensity$x), ". \n"))## The median of the GSW Total Score is 108.
## The median of the GSW Total Score at Home Games is112.
Warrior’s scoring abilities doesn’t seem to be affected by away games significantly. Only a 4 points difference between at home vs away.
This dataset will be used to compare Steph Curry’s player performance during regular season games vs playoff games. For accuracy of performance, the data will be filtered on seasons that the Warriors made it to the playoffs.
# What is Curry's shooting percentage during regular season - including fg and ft
regular = seasons_in_playoff[which(seasons_in_playoff$Type == "REGULAR SEASON STATS") ,]
shooting_percentage = regular$`Successful Shots`/regular$`Total Shots`
dates = regular$Dates
regdf = data.frame(dates, shooting_percentage)
# What is Curry's shooting percentage during conference and finals
postseason = subset(seasons_in_playoff, !(seasons_in_playoff$Type %in% "REGULAR SEASON STATS"))
shooting_percentage = postseason$`Successful Shots`/postseason$`Total Shots`
dates = postseason$Dates
psdf = data.frame(dates, shooting_percentage)
##
#summary(regdf$shooting_percentage)
#summary(psdf$shooting_percentage)
cat(paste("Regular seasson shooting pecentage:", sd(regdf$shooting_percentage), ". \n",
"Post reson shooting percentage: ", sd(psdf$shooting_percentage), ". \n",
"since the sd is relatively the same, a t-test is applied. "))## Regular seasson shooting pecentage: 0.117206780167678 .
## Post reson shooting percentage: 0.105224210827332 .
## since the sd is relatively the same, a t-test is applied.
# T- test psdf and regdf
t.test(regdf$shooting_percentage, psdf$shooting_percentage)##
## Welch Two Sample t-test
##
## data: regdf$shooting_percentage and psdf$shooting_percentage
## t = 1.7141, df = 136.98, p-value = 0.08877
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.00326292 0.04574416
## sample estimates:
## mean of x mean of y
## 0.4739225 0.4526819
cat(paste("H0: µ1 = µ2 (psdf and regdf means are equal)", "\n","HA: µ1 ≠ µ2 (psdf and regdf means are not equal)"))## H0: µ1 = µ2 (psdf and regdf means are equal)
## HA: µ1 ≠ µ2 (psdf and regdf means are not equal)
Conclusion: The p-value = 0.08 from the Welch Two Sample t-test. There’s no significant shooting difference between regular season games and playoff games for Steph Curry.
fig <- plot_ly(x=~regdf$dates, y=~regdf$shooting_percentage, type = 'scatter', name = "regular season games")
fig <- fig %>% add_trace(x=~psdf$dates, y =~psdf$shooting_percentage, name = "playoff games") %>%layout(title = "Shoot Percentage by Dates", yaxis = list(title ="Shooting Percentage"), xaxis = list(title ="Game Dates"))
figThe spread of Steph Curry’s shooting percentage is relatively similar between Regular Season Games and Playoff Games. However, there are more variance in the Regular Season Games.
regular$`Score GS` = as.numeric(regular$`Score GS`)
r = regular$Result
x = c()
for (i in 1:length(r)) {
if (r[i] == "W"){
x= c(x,1)
} else{
x = c(x, 0)
}
}
stats = regular[,c("Score GS", "3 Points Succesful","PTS", "REB", "AST", "BLK", "STL","TO", "Minutes")]
stats$Result_Encoded = x
regcorr = cor(stats)
#postseason
postseason$`Score GS` = as.numeric(postseason$`Score GS`)
r = postseason$Result
x = c()
for (i in 1:length(r)) {
if (r[i] == "W"){
x= c(x,1)
} else{
x = c(x, 0)
}
}
psstats = postseason[,c("Score GS", "3 Points Succesful","PTS", "REB", "AST", "BLK", "STL","TO", "Minutes")]
psstats$Result_Encoded = x
pscorr =cor(psstats)
fig1 <- plot_ly(x=colnames(regcorr), y=rownames(regcorr), z = regcorr, type = "heatmap", color= c("cyan", "blue")) %>%
layout(margin = list(l=120), color= c("cyan", "blue"))
fig2 <- plot_ly(x=colnames(pscorr), y=rownames(pscorr), z = pscorr, type = "heatmap", color= c("cyan", "blue")) %>%
layout(margin = list(l=120))
fig <- subplot(fig1, fig2, nrows = 2, margin = 0.07) %>% layout(title = "Correlation HeatMap Reg vs Post")
figResult_Encoded = 1 if Win, 0 if Loss. The correlation heatmap did not reveal any significant correlation between winning and another factor. Take aways is that Wins are positively correlated with Warriors’ Team Scores.
Small correlations to Point Out: 1. Positively correlated with Steph Curry’s points and Rebounds. 2. Negatively correlated with Steph Curry’s Turn Overs and Playtime. * Further analysis on this topic is to check if Steph is more likely to make more mistakes as he plays longer due to exhaustion and lost of focus.
xreg = seq(15, 60, by = 5)
p_win_reg = c()
for(i in xreg){
testdf = subset(stats, (stats$Minutes >= i) & (stats$Minutes < i+5))
probwins = sum(testdf$Result_Encoded)/nrow(testdf)
p_win_reg = c(p_win_reg, probwins)
}
xps = seq(15, 60, by = 5)
p_win = c()
for(i in xps){
testdf = subset(psstats, (psstats$Minutes >= i) & (psstats$Minutes < i+5))
probwins = sum(testdf$Result_Encoded)/nrow(testdf)
p_win = c(p_win, probwins)
}
fig = plot_ly(x = xps, y = p_win, type = "scatter", name = "playoffs")
fig = fig %>% add_trace(x=xreg, y= p_win_reg, name = "regular seasons") %>% layout(title = "Chance of Win vs Total Game Time", yaxis = list(title = "Probability of Winning"), xaxis = list(title = "Total Game TIme"))
figThere appears to be a linear negative relationships between Steph Curry’s total time played that the Warriors’ chance of winning. This can be caused by the depth of the Warriors’ Rosters. When the team has no depth, and players are injured, Steph Curry has to play more minutes which decreasing their chance of winning.
May factors affects a team’s performance. In examining the many questions above, we attempt to verify the legitimacy of many NBA Myths.
Hot Hand doesn’t appear to apply for Steph Curry or Kevin Durant. They appear to to be consistent players who makes a successful shot on an averaged of 1-2 attempts. For Jokic, there was not enough data to conclusively state that he is a streaky player. He could just be taking easy shots and making smarter decisions. The x and y coordinates of Jimmy Butler’s shots appears to be highly correlated. A plausible explanation of this is that he’s a rim put back player or a dunker, not a shoot. So he’s really close to the basket.
The Home Team does appear to have a slight advantage. Furthermore, teams appear to have a challenging time winning at Denver in the 2017 season. However, since there’s very high variance in the mean associated with home court advantage, we cannot conclude that it’s valid. Many confounding variabels are at play, e.g. some teams are more dominant given their makeup of players.
Steph Curry is known as an elite player. In the 5 years that he made the playoffs, there is no significant difference between his performance during the regular vs post season. (Granted that he won MVP for two of those seasons, proving that he was playing well during the regular seasons.) There appears to be a negative correlation given the Warriors’ chance of winning and Steph Curry’s play time. Further analysis needs to be conducted to make any conclusive statemetns. Could this be attributed to player exhaustion and loss of focus during long games? Maybe there’s no depth in the team roster and other star players are injured/out?